Main Questions
Question 1 & 2
Read the data as a pandas DataFrame. Filter the data to
include only rows where Year is 1962 and then make a scatter plot
comparing CO2 emissions (metric tons per capita) and
gdpPercap for the filtered data.
df %>%
filter(Year == 1962) %>%
ggplot(aes(y = co2PerCap, x = gdpPercap)) +
theme_classic() +
geom_point(color = "red") +
labs(y = "CO2 emissions (metric tons per capita)", x = "GDP in purchasing power parity (USD per capita)") +
ggtitle("GDP vs. CO2 emissions in 1962")

df %>%
filter(Year == 1962) %>%
ggplot(aes(y = co2PerCap, x = gdpPercap)) +
theme_classic() +
scale_y_log10() +
scale_x_log10() +
geom_point(color = "red") +
ggtitle("log GDP vs. log CO2 emissions in 1962") +
xlab("log GDP in purchasing power parity (USD per capita)") +
ylab("log CO2 emissions (metric tons per capita)")
After visualizing the original data, we see that there are some large
values that are far from most of the smaller values which appear
clustered/close to each other. It appears as a GPD per capita increases,
CO2 emissions increases at a faster rate, up until the GDP per capital
is at about 200,000. We cannot determine if the relationship between the
x and y values are linear by just visualizing them.
However, given that the order of magnitude of both x and y values are
large, we log transform both x (GDP) and y values (CO2).
Question 3
On the filtered data, calculate the pearson correlation of
CO2 emissions (metric tons per capita) and
gdpPercap. What is the Pearson R value and associated p
value?
df <- df %>%
mutate(logCO2 = log10(co2PerCap), logGDP = log(gdpPercap))
mod <- cor.test(x = df$logCO2, y = df$logGDP) %>% tidy()
mod %>%
kbl() %>%
kable_styling()
|
estimate
|
statistic
|
p.value
|
parameter
|
conf.low
|
conf.high
|
method
|
alternative
|
|
0.9018728
|
71.77423
|
0
|
1182
|
0.8906647
|
0.9119854
|
Pearson’s product-moment correlation
|
two.sided
|
Pearson’s correlation coefficient indicates the strength of the
relationship between the two variables. logGDP is
positively associated with logCO2at r=0.9.
Question 4
In what year is the correlation between
CO2 emissions (metric tons per capita) and
gdpPercap the strongest?
res <- df %>%
group_by(Year) %>%
summarise(
tidy(
cor.test(x = co2PerCap, y = gdpPercap, method = "kendall")
)
) %>%
dplyr::slice_max(estimate, n = 1)
res %>%
kbl() %>%
kable_styling()
|
Year
|
estimate
|
statistic
|
p.value
|
method
|
alternative
|
|
2002
|
0.780129
|
12.90234
|
0
|
Kendall’s rank correlation tau
|
two.sided
|
Kendall’s Tau correlation was used since the two variables are not
normally distributed, as we have seen from Question 1 when plotting the
two variables. Kendall’s Tau correlation between CO2 emissions and GDP
per capita is the highest during year 2002, at
r=0.78.
Question 5
Using plotly or bokeh, create an interactive scatter plot
comparing CO2 emissions (metric tons per capita) and
gdpPercap.
fig <- df %>%
filter(Year == res$Year) %>%
plot_ly(
x = ~logGDP,
y = ~logCO2,
size = ~pop,
color = ~continent,
# frame = ~Year,
text = ~`Country Name`,
hoverinfo = "text",
type = "scatter",
mode = "markers"
)
fig <- fig %>% layout(
xaxis = list(
type = "log"
)
)
fig %>%
layout(
title = paste0("log GDP vs. log CO2 emissions in ", res$Year), plot_bgcolor = "#e5ecf6", xaxis = list(title = "log CO2 Emissions"),
yaxis = list(title = "log GDP"), legend = list(title = list(text = "<b> Continent </b>"))
)
The interactive plot above depicts the relationship between CO2
emissions and GDP per capita in the year (2002) where the correlation
between the two variables is the highest as demonstrated in the question
above. Hovering over the dots displays the country names, and the dot
sizes correspond to the population size of that country.
More Questions
Question 1
What is the relationship between between continent and
Energy use (kg of oil equivalent per capita)?
res <- df %>%
filter(!is.na(continent)) %>%
kruskal.test(continent, energyUsePerCap) %>%
tidy()
res %>%
kbl() %>%
kable_styling()
|
statistic
|
p.value
|
parameter
|
method
|
|
12963.99
|
0
|
21
|
Kruskal-Wallis rank sum test
|
We use the Kruskal-Wallis test because it is a non-parametric version
of ANOVA. It does not assume normal distribution of residuals The test
works on 2 or more independent samples, which may have different
sizes.
There is a significant relationship between continent and
energy use, as the p-value is smaller than the significant
threshold, which we set at 0.05. The p-value is negligible because it is
very close to 0.
Question 2
Is there a significant difference between Europe and Asia with
respect to Imports of goods and services (% of GDP) in the
years after 1990?
mod <- df %>%
filter(continent %in% c("Asia", "Europe"), Year > 1990) %>%
glm(importPercentageGDP ~ continent, data = .) %>%
tidy()
mod %>%
kbl() %>%
kable_styling()
|
term
|
estimate
|
std.error
|
statistic
|
p.value
|
|
(Intercept)
|
46.845311
|
2.613728
|
17.922793
|
0.0000000
|
|
continentEurope
|
-5.056071
|
3.564314
|
-1.418526
|
0.1575197
|
While there are many candidate statistical tests we could use to
compare the difference in the variable of interest between two groups, a
simple linear regression is chosen, because:
We can find out to what extent does the regressor (continent type)
affects the regressand (imports of goods and services in terms of % of
GDP).
\[\begin{equation}
Y_i = \beta_0 + \beta_1 continent + \epsilon_i
\end{equation}\]
The null hypothesis is whether \(\beta_{1}\) = 0, where variable Continent =
1 if Europe, = 0 if Asia.
We fit a linear regression model to compare the two groups. There is
no significant difference between Europe and Asia with respect
to the amount of imports of goods and services in terms percentage of
GDP (p<0.05).
A t-test would have also provided us the answer to the question
above; linear regression provides the additional advantage of informing
us to what extent a change from Asia=0 to Europe=1 affect outcome
variable (imports of goods and services in terms of % of GDP), which is
indicated by the beta weight, -5.06.
Question 3
What is the country (or countries) that has the highest
Population density (people per sq. km of land area) across
all years? (i.e., which country has the highest average ranking in this
category across each time point in the dataset?
df %>%
select(Year, `Country Name`, popDensityPerSqKm) %>%
arrange(Year, desc(popDensityPerSqKm)) %>%
group_by(Year) %>%
dplyr::slice_max(popDensityPerSqKm, n = 3) %>%
ggplot(data = ., aes(x = as.factor(Year), y = popDensityPerSqKm, fill = as.factor(`Country Name`))) +
geom_bar(position = "dodge", stat = "identity") +
theme_classic() +
labs(x = "Year", y = "population density (per sq.km)", fill = "Country") +
ggtitle("Population density in the top 5 highest density countries in Years 1962-2007")

res <- df %>%
select(Year, `Country Name`, popDensityPerSqKm) %>%
arrange(Year, desc(popDensityPerSqKm)) %>%
group_by(Year) %>%
mutate(
rnks = row_number(desc(popDensityPerSqKm))
) %>%
group_by(`Country Name`) %>%
summarize(mean.rank = mean(rnks)) %>%
slice_min(mean.rank, n = 3)
country1 <- res$`Country Name`[1]
country2 <- res$`Country Name`[2]
res %>%
kbl() %>%
kable_styling()
|
Country Name
|
mean.rank
|
|
Macao SAR, China
|
1.5
|
|
Monaco
|
1.5
|
|
Hong Kong SAR, China
|
3.1
|
The highest-rank country in terms of population density changes
across the years, as we can tell from the graph above.
To find out which country has the highest averaged ranking, we take
the average of their ranks across the years based on their population
density. Macao SAR, China and Monaco are tied at the first
place because their averaged ranking across the period
1962-2007 is the same at 1.5.
Question 4
What country (or countries) has shown the greatest increase in
Life expectancy at birth, total (years) since 1962?
res <- df %>%
select(Year, `Country Name`, `Life expectancy at birth, total (years)`) %>%
group_by(`Country Name`) %>%
summarise(
diff = `Life expectancy at birth, total (years)`[Year == 2007] - `Life expectancy at birth, total (years)`[Year == 1962],
.groups = "drop"
) %>%
dplyr::slice_max(diff, n = 5)
res %>%
kbl() %>%
kable_styling()
|
Country Name
|
diff
|
|
Maldives
|
36.91615
|
|
Bhutan
|
33.19895
|
|
Timor-Leste
|
31.08515
|
|
Tunisia
|
30.86076
|
|
Oman
|
30.82310
|
res %>%
ggplot(aes(x = reorder(`Country Name`, -diff), y = diff)) +
geom_bar(position = "dodge", stat = "identity", fill = "lightblue") +
theme_classic() +
ggtitle("Increase in Life Expectancy in Years (Period: 1962-2007)") +
ylab("Years") +
xlab("Country") +
geom_text(aes(label = round(diff, 2)), position = position_dodge(width = 0.9), vjust = -0.25)

From the graph above, we see that the top 5 countries that has shown
the greatest increase in life expectancy are: Maldives, Bhutan,
Timor-Leste, Tunisia, Oman
This answer is based on the absolute difference in life expectancy
between year 2007 and year 1962.